Binary Neural Networks Algorithms, Architectures, and Applications (Baochang Zhang, Sheng Xu, Mingbao Lin etc.)

DCP-NAS: Discrepant Child-Parent Neural Architecture Search for 1-Bit CNNs

105

ොߙ

݂ሺߙሻ

መ݂ሺොߙሻ

Optimized solution

Sub-optimized solution

ߙଵ

כ ൌොߙଵ

ߙଶ

כ ൌොߙଶߙ

ߙଵ

ߙଶ

ොߙଵ

ොߙଶ

ොߙଵ

ߙଵ

ߙଶ

ොߙଵ

ොߙଶ

ߙଶ

Optimization of real-valued architecture

Optimization of binary architecture

FIGURE 4.10

Motivation for DCP-NAS. We ﬁrst show directly binarizing real-valued architecture to 1-

bit is sub-optimal. Thus we use tangent propagation (middle) to ﬁnd an optimized 1-bit

neural architecture along the tangent direction, leading to a better-performed 1-bit neural

architecture.

4.4

DCP-NAS:

Discrepant

Child-Parent

Neural

Architecture

Search for 1-Bit CNNs

Based on CP-NAS introduced above, the real-valued models converge much faster than the

1-bit models, as revealed in [157], which motivates us to use the tangent direction of the

Parent supernet (real-valued model) as an indicator of the optimization direction for the

Child supernet (1-bit model). We assume that all the possible 1-bit neural architectures

can be learned from the tangent space of the Parent model, based on which we introduce a

Discrepant Child-Parent Neural Architecture Search (DCP-NAS) [135] method to produce

an optimized 1-bit CNN. Speciﬁcally, as shown in Fig. 4.10, we use the Parent model to ﬁnd

a tangent direction to learn the 1-bit Child through tangent propagation rather than directly

binarizing the Parent-to-Child relationship. Since the tangent direction is based on second-

order information, we further accelerate the search process by Generalized Gauss-Newton

matrix (GGN), leading to an eﬃcient search process. Furthermore, a coupling relationship

exists between weights and architecture parameters in such DARTS-based [151] methods,

leading to an asynchronous convergence and an insuﬃcient training process. To overcome

this obstacle, we propose a decoupled optimization for training the Child-Parent model,

leading to an eﬀective and optimized search process. The overall framework of our DCP-

NAS is shown in Fig. 4.11.

4.4.1

Preliminary

Neural architecture search Given a conventional CNN model, we denote w ∈W and

W = RCout×Cin×K×K and ain ∈RCin×W ×H as its weights and feature maps in the speciﬁc

layer. Cout and Cin represent the output and input channels of the speciﬁc layer. (W, H) is

the width and height of the feature maps and K is the kernel size. Then we have

aout = ain ⊗w,

(4.19)